On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors

نویسندگان

Khairul Kabir

Azzam Haidar

Stanimire Tomov

Jack J. Dongarra

چکیده

The dramatic change in computer architecture due to the manycore paradigm shift, made the development of numerical routines that are optimal extremely challenging. In this work, we target the development of numerical algorithms and implementations for Xeon Phi coprocessor architecture designs. In particular, we examine and optimize the general and symmetric matrix-vector multiplication routines (gemv/symv), which are some of the most heavily used linear algebra kernels in many important engineering and physics applications. We describe a successful approach on how to address the challenges for this problem, starting from our algorithm design, performance analysis and programing model, to kernel optimization. Our goal, by targeting lowlevel, easy to understand fundamental kernels, is to develop new optimization strategies that can be effective elsewhere for the use on manycore coprocessors, and to show significant performance improvements compared to the existing state-of-the-art implementations. Therefore, in addition to the new optimization strategies, analysis, and optimal performance results, we finally present the significance of using these routines/strategies to accelerate higher-level numerical algorithms for the eigenvalue problem (EVP) and the singular value decomposition (SVD) that by themselves are foundational for many important applications.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Parallel Matrix Multiplication Method Adapted on Fibonacci Hypercube Structure

The objective of this study was to develop a new optimal parallel algorithm for matrix multiplication which could run on a Fibonacci Hypercube structure. Most of the popular algorithms for parallel matrix multiplication can not run on Fibonacci Hypercube structure, therefore giving a method that can be run on all structures especially Fibonacci Hypercube structure is necessary for parallel matr...

متن کامل

The Basics of Performance Libraries for Embedded Systems

With each successive generation of processors, the number of developers who want to program in assembly language is diminishing. Further, every generation has required a new round of assemblers, compilers, and a new set of developer expertise. Tools vendors recognized a need to provide developers with software tools to support every processor release with a variety of functionality, a storehous...

متن کامل

Accelerating GPU Kernels for Dense Linear Algebra

Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting – a set of GPU specific optimiz...

متن کامل

Subdivision Surface Evaluation as Sparse Matrix-Vector Multiplication

We present an interpretation of subdivision surface evaluation in the language of linear algebra. Specifically, the vector of surface points can be computed by left-multiplying the vector of control points by a sparse subdivision matrix. This “matrix-driven” interpretation applies to any level of subdivision, holds for many common subdivision schemes (including Catmull-Clark and Loop), supports...

متن کامل

Autotuning Divide-and-Conquer Matrix-Vector Multiplication

Divide and conquer is an important concept in computer science. It is used ubiquitously to simplify and speed up programs. However, it needs to be optimized, with respect to parameter settings for example, in order to achieve the best performance. The problem boils down to searching for the best implementation choice on a given set of requirements, such as which machine the program is running o...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors

نویسندگان

چکیده

منابع مشابه

A New Parallel Matrix Multiplication Method Adapted on Fibonacci Hypercube Structure

The Basics of Performance Libraries for Embedded Systems

Accelerating GPU Kernels for Dense Linear Algebra

Subdivision Surface Evaluation as Sparse Matrix-Vector Multiplication

Autotuning Divide-and-Conquer Matrix-Vector Multiplication

عنوان ژورنال:

اشتراک گذاری